[SPARK-13079] [SQL] Extend and implement InMemoryCatalog #11069

andrewor14 · 2016-02-04T01:39:11Z

This is a step towards consolidating SQLContext and HiveContext.

This patch extends the existing Catalog API added in #10982 to include methods for handling table partitions. In particular, a partition is identified by PartitionSpec, which is just a Map[String, String]. The Catalog is still not used by anything yet, but its API is now more or less complete and an implementation is fully tested.

About 200 lines are test code.

These are a subset of the public interfaces exposed by Hive. This commit just adds the skeleton without implementing any of them.

andrewor14 · 2016-02-04T01:59:18Z

retest this please

rxin · 2016-02-04T03:01:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+
+
+object Catalog {
+  type PartitionSpec = Map[String, String]


need to document this is mapping from column names to values.

rxin · 2016-02-04T03:09:20Z

LGTM otherwise. Feel free to merge this one and address issues in your next pull request.

SparkQA · 2016-02-04T03:31:00Z

Test build #50728 has finished for PR 11069 at commit 1b40002.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-02-04T03:32:28Z

I'm going to merge this.

This patch incorporates review feedback from #11069, which is already merged. Author: Andrew Or <andrew@databricks.com> Closes #11080 from andrewor14/catalog-follow-ups.

## What changes were proposed in this pull request? This is a step towards merging `SQLContext` and `HiveContext`. A new internal Catalog API was introduced in #10982 and extended in #11069. This patch introduces an implementation of this API using `HiveClient`, an existing interface to Hive. It also extends `HiveClient` with additional calls to Hive that are needed to complete the catalog implementation. *Where should I start reviewing?* The new catalog introduced is `HiveCatalog`. This class is relatively simple because it just calls `HiveClientImpl`, where most of the new logic is. I would not start with `HiveClient`, `HiveQl`, or `HiveMetastoreCatalog`, which are modified mainly because of a refactor. *Why is this patch so big?* I had to refactor HiveClient to remove an intermediate representation of databases, tables, partitions etc. After this refactor `CatalogTable` convert directly to and from `HiveTable` (etc.). Otherwise we would have to first convert `CatalogTable` to the intermediate representation and then convert that to HiveTable, which is messy. The new class hierarchy is as follows: ``` org.apache.spark.sql.catalyst.catalog.Catalog - org.apache.spark.sql.catalyst.catalog.InMemoryCatalog - org.apache.spark.sql.hive.HiveCatalog ``` Note that, as of this patch, none of these classes are currently used anywhere yet. This will come in the future before the Spark 2.0 release. ## How was the this patch tested? All existing unit tests, and HiveCatalogSuite that extends CatalogTestCases. Author: Andrew Or <andrew@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #11293 from rxin/hive-catalog.

Andrew Or added 4 commits February 3, 2016 14:28

Add partitions related methods to Catalog

38166af

These are a subset of the public interfaces exposed by Hive. This commit just adds the skeleton without implementing any of them.

Modify partition method based on feedback

461960b

Implement new partition methods in InMemoryCatalog

1000646

Add tests for functions and partitions methods

1b40002

andrewor14 changed the title ~~[SPARK-13079] Extend Catalog API + implement InMemoryCatalog~~ [SPARK-13079] [SQL] Extend and implement InMemoryCatalog Feb 4, 2016

rxin reviewed Feb 4, 2016
View reviewed changes

asfgit closed this in a648311 Feb 4, 2016

andrewor14 deleted the catalog branch February 4, 2016 03:50

andrewor14 mentioned this pull request Feb 4, 2016

[SPARK-13079] [SQL] InMemoryCatalog follow-ups #11080

Closed

asfgit pushed a commit that referenced this pull request Feb 4, 2016

[SPARK-13079][SQL] InMemoryCatalog follow-ups

bd38dd6

This patch incorporates review feedback from #11069, which is already merged. Author: Andrew Or <andrew@databricks.com> Closes #11080 from andrewor14/catalog-follow-ups.

andrewor14 mentioned this pull request Feb 12, 2016

[SPARK-13080] [SQL] Implement new Catalog API using Hive #11189

Closed

rxin mentioned this pull request Feb 21, 2016

[SPARK-13080] [SQL] Implement new Catalog API using Hive #11293

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13079] [SQL] Extend and implement InMemoryCatalog #11069

[SPARK-13079] [SQL] Extend and implement InMemoryCatalog #11069

Uh oh!

andrewor14 commented Feb 4, 2016

Uh oh!

andrewor14 commented Feb 4, 2016

Uh oh!

rxin Feb 4, 2016

Uh oh!

rxin commented Feb 4, 2016

Uh oh!

SparkQA commented Feb 4, 2016

Uh oh!

rxin commented Feb 4, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-13079] [SQL] Extend and implement InMemoryCatalog #11069

[SPARK-13079] [SQL] Extend and implement InMemoryCatalog #11069

Uh oh!

Conversation

andrewor14 commented Feb 4, 2016

Uh oh!

andrewor14 commented Feb 4, 2016

Uh oh!

rxin Feb 4, 2016

Choose a reason for hiding this comment

Uh oh!

rxin commented Feb 4, 2016

Uh oh!

SparkQA commented Feb 4, 2016

Uh oh!

rxin commented Feb 4, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants